This dataset has been sourced from Kaggle.It contains marketing data of 2240 customers and includes 28 features on customer profiles, products purchased, campaign details and channel performance.
options(warn=-1)
library(knitr)
library(DT)
#reading the data
setwd("C:/Users/heena/OneDrive/Desktop/CS 544/Project")
#df <- read.csv(file.choose(),header = T)
df<-read.csv("marketing_data.csv")
#seeing first 6 rows of the data
head(df)
#checking dimensions of data
dim(df)
#feature names
colnames(df)
#checking column datatypes
sapply(df,class)#kable(head(df),caption="Marketing Data",format="markdown")
datatable(df)Our objective was to analyze the customer data and answer the following questions:
We converted the Income variable from character to numeric datatype. We customized the levels for Education variable and converted it to an ordered factor (High school < Under.graduate < Master < PhD). We also converted the columns Country,Marital status, Kidhome, Teenhome, whether the customer complained or not and whether the customer accepted any of the 5 campaigns to factor data type (unordered). Null values were found in the Education column and were removed. We created a function to check for outliers using the Inter Quartile Range criteria. Outliers were found in certain columns but we decided to keep them since they show important variability in the data.These values were quite close to each other and did not seem like data entry errors. The income and age columns were found to have some very extreme values, we found the row number for the same and created a subset of the data excluding those particular rows. A column was added for the customer’s age (calculated from Year of Birth). Expenditure on various products was added to give a new feature, total expenditure by every customer.
library("tidyverse")
library("plotly")
df$Income<-parse_number(df$Income)
df$Education[df$Education=='Basic']<-'High.school'
df$Education[df$Education=='Graduation']<-'Under.graduate'
df$Education[df$Education=='2n Cycle']<-'Master'
df$Education<-factor(df$Education,ordered=T,levels=c('High school','Under.graduate','Master','PhD'))
df$Marital_Status<-factor(df$Marital_Status)
df$Country<-factor(df$Country)
df$Kidhome<-factor(df$Kidhome)
df$Teenhome<-factor(df$Teenhome)
df$Complain<-factor(df$Complain)
df$Response<-factor(df$Response)
df$AcceptedCmp1<-factor(df$AcceptedCmp1)
df$AcceptedCmp2<-factor(df$AcceptedCmp2)
df$AcceptedCmp3<-factor(df$AcceptedCmp3)
df$AcceptedCmp4<-factor(df$AcceptedCmp4)
df$AcceptedCmp5<-factor(df$AcceptedCmp5)
colSums(is.na(df))
df2<-df[complete.cases(df),] #removing NAs
colSums(is.na(df2))
outliers<- function(column){
lower.bound<-quantile(column,0.25) - 1.5*IQR(column)
upper.bound<-quantile(column,0.75) + 1.5*IQR(column)
return(column[column<lower.bound | column>upper.bound])
}
outliers(df2$Income)
which(grepl(666666, df2$Income))
#remove row 514 (specifying data.frame, row index)
df2 <- df2[-c(514),]
outliers(df2$MntWines)
outliers(df2$MntFishProducts)
outliers(df2$MntGoldProds)
outliers(df2$MntMeatProducts)
outliers(df2$MntSweetProducts)
outliers(df2$MntFruits)
outliers(df2$NumDealsPurchases)
outliers(df2$NumCatalogPurchases)
outliers(df2$NumStorePurchases)
outliers(df2$NumWebPurchases)
df2$Age<-2021-df2$Year_Birth
#checking outliers in age
outliers(df2$Age)
which(grepl(128, df2$Age))
which(grepl(122, df2$Age))
which(grepl(121, df2$Age))
df2 <- df2[-c(495,797,2149),]
df2$Total.expenditure<-df2$MntWines+df2$MntFruits+df2$MntGoldProds+df2$MntMeatProducts+df2$MntSweetProducts+df2$MntFishProducts
head(df2)Is there an influence of the number of kids a customer has on their total expenditure? The number of kids in a household often has a great impact on how much a customer spends. Prior knowledge of this can help companies come up with better and more customized marketing schemes. Kids represent an important demographic to marketers. In addition to their own purchasing power, they also influence their parents’ buying decisions and expenditure. Below is a boxplot and histogram for the distribution of total expenses based on the number of kids a customer has.
plot_ly (df2,x=df2$Total.expenditure[df2$Kidhome=='0'],y = ~df2$Kidhome[df2$Kidhome=='0'], type="box", name = '0 kids',color=I("greenyellow")) %>%
add_trace(x=df2$Total.expenditure[df2$Kidhome=='1'],y = ~df2$Kidhome[df2$Kidhome=='1'], name = '1 kid',color=I("orange")) %>%
add_trace(x=df2$Total.expenditure[df2$Kidhome=='2'],y = ~df2$Kidhome[df2$Kidhome=='2'],name = '2 kids',color=I("hotpink")) %>%
layout(xaxis = list(title = 'Total expenditure')) %>%
layout(yaxis = list(title = 'Kids'))%>%
layout(title = 'Boxplot for total expenditure based on number of kids at home',plot_bgcolor='#e5ecf6',
xaxis = list(
zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff',range=c(0,3000)),
yaxis = list(
zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff')) -> p
pdf20<-df2[df2$Kidhome==0,]
df21<-df2[df2$Kidhome==1,]
df22<-df2[df2$Kidhome==2,]
fig1 <- plot_ly(x = df20$Total.expenditure, type = 'histogram',color=I("greenyellow"),name="0 kids",opacity=0.5)%>%
layout(xaxis=list(range=c(0,3000)),yaxis=list(range=c(0,300)) )
fig2 <- plot_ly(x=df21$Total.expenditure, type ='histogram',color=I("orange"),name="1 kid",opacity=0.5) %>%
layout(xaxis=list(range=c(0,3000)),yaxis=list(range=c(0,300)) )
fig3 <- plot_ly(x=df22$Total.expenditure, type='histogram',color=I("hotpink"),name="2 kids",opacity=0.5) %>%
layout(xaxis=list(range=c(0,3000)),yaxis=list(range=c(0,300)) )
fig <- subplot(fig1, fig2, fig3, nrows = 3) %>%
layout(title = list(text = "Histograms for total expenditure based on number of kids at home"),
plot_bgcolor='#e5ecf6'
)
figdf2%>% group_by(Kidhome)%>% summarise(sd(Total.expenditure))
median(df2$Total.expenditure[df2$Kidhome=='0'])
mean(df2$Total.expenditure[df2$Kidhome=='0'])df2%>% group_by(Kidhome)%>% summarise(sd(Total.expenditure))
median(df2$Total.expenditure[df2$Kidhome=='1'])
mean(df2$Total.expenditure[df2$Kidhome=='1'])df2%>% group_by(Kidhome)%>% summarise(sd(Total.expenditure))
median(df2$Total.expenditure[df2$Kidhome=='2'])
mean(df2$Total.expenditure[df2$Kidhome=='2'])We try to explore the effect of the number of teenagers in a household on total expenditure. Marketing to young people is a strategy used by a wide range of companies. Knowing their most likely customers can help businesses better target their sales strategies by focusing on the majority demographic or offering discounts and offers to less likely subgroups.
plot_ly(df2,x=df2$Total.expenditure[df2$Teenhome=='0'],y = ~df2$Teenhome[df2$Teenhome=='0'], type="box", name = '0 teens',color=I("greenyellow")) %>%
add_trace(x=df2$Total.expenditure[df2$Teenhome=='1'],y = ~df2$Teenhome[df2$Teenhome=='1'], name = '1 teens',color=I("orange")) %>%
add_trace(x=df2$Total.expenditure[df2$Teenhome=='2'],y = ~df2$Teenhome[df2$Teenhome=='2'],name = '2 teens',color=I("hotpink")) %>%
layout(xaxis = list(title = 'Total expenditure')) %>%
layout(yaxis = list(title = 'Teenagers'))%>%
layout(title = 'Boxplot for total expenditure based on number of teenagers at home',plot_bgcolor='#e5ecf6',
xaxis = list(
zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff',range=c(0,3000)),
yaxis = list(
zerolinecolor = '#ffff',
zerolinewidth = 2,
gridcolor = 'ffff')) -> p2
p2df20b<-df2[df2$Teenhome==0,]
df21b<-df2[df2$Teenhome==1,]
df22b<-df2[df2$Teenhome==2,]
fig1b <- plot_ly(x = df20b$Total.expenditure, type = 'histogram',color=I("greenyellow"),name="0 teens",opacity=0.5)%>%
layout(xaxis=list(range=c(0,3000)),yaxis=list(range=c(0,400)) )
fig2b <- plot_ly(x=df21b$Total.expenditure, type ='histogram',color=I("orange"),name="1 teens",opacity=0.5) %>%
layout(xaxis=list(range=c(0,3000)),yaxis=list(range=c(0,400)) )
fig3b <- plot_ly(x=df22b$Total.expenditure, type='histogram',color=I("hotpink"),name="2 teens",opacity=0.5) %>%
layout(xaxis=list(range=c(0,3000)),yaxis=list(range=c(0,400)) )
figb <- subplot(fig1b, fig2b, fig3b, nrows = 3) %>%
layout(title = list(text = "Histograms for total expenditure based on number of teens at home"),
plot_bgcolor='#e5ecf6')
figbWhich country has the highest expenditure? Global marketing focuses a product on the needs of potential buyers in various countries. One of the key aspects in global marketing is knowing the consumer and connecting with the audience. Knowing who needs your product, what form they need it in, and how to market it in a way that strengthens the brand are essential to every business. It is of utmost importance to identify countries where the business’ product might be successful, and then localize the brand to reflect the needs of those communities.
req = aggregate(df2$Total.expenditure, by=list(Category=df2$Country), FUN=sum)
req = as.data.frame.matrix(req)
req$Category = factor(req$Category, levels = req$Category[order(req$x, decreasing = TRUE)])
req %>% plot_ly(x = ~Category, y = ~x, type = 'bar',opacity=0.5,color = I("lightslategrey")) %>% layout(title = "Barplot for country wise expenditure",
xaxis = list(title = "Country"),
yaxis = list(title = "Expenditure",range=c(0,700000)))Singapore was found to have highest expenditure, followed by Saudi Arabia and Canada. Montenagro has the lowest expenditure. Top 5 countries in terms of expenditure:
Which country spends the most on each product category? We try to explore how much each country spends on various products, including Wine, Fruits, Sweets, Meat, Gold and Fish.
fig1c <- plot_ly(x = df2$Country, y =df2$MntFishProducts, type = 'bar', opacity=0.5,color = I("greenyellow"),name="Fish")
fig2c <- plot_ly(x = df2$Country, y = df2$MntFruits, type = 'bar', opacity=0.5,color = I("purple"),name="Fruits")%>%
layout(yaxis=list(range=c(0,30000)))
fig3c <- plot_ly(x = df2$Country, y=df2$MntGoldProds, type = 'bar', opacity=0.5,color = I("orange"),name="Gold")%>%
layout(yaxis=list(range=c(0,50000)))
fig4c <- plot_ly(x =df2$Country, y=df2$MntMeatProducts, type = 'bar', opacity=0.5,color = I("magenta"), name="Meat")%>%
layout(yaxis=list(range=c(0,200000)))
fig5c <- plot_ly(x = df2$Country, y=df2$MntSweetProducts, type = 'bar', opacity=0.5,color = I("yellow"),name="Sweets")
fig6c <- plot_ly(x =df2$Country, y=df2$MntWines, type = 'bar', opacity=0.5,color = I("salmon"), name="Wines")%>%
layout(yaxis=list(range=c(0,400000)))
figc <- subplot(fig1c, fig2c, fig3c, fig4c, fig5c, fig6c, nrows = 3) %>%
layout(title = list(text = "Barplots for expenditure of countries on different products"),
plot_bgcolor='#e5ecf6')
figc#Grouped barplot
df2$Country <- factor(df2$Country, levels = unique(df2$Country)[order(df2$MntWines, decreasing = FALSE)])
fig1d <- plot_ly(data=df2, x = ~Country, y = ~MntFishProducts, type = 'bar', opacity=0.6,color = I("greenyellow"),name="Fish")
fig2d <- fig1d %>% add_trace(y = ~MntFruits, name = 'Fruits', opacity=0.6,color = I("purple"))
fig3d <- fig2d %>% add_trace(y = ~MntGoldProds, name = 'Gold', opacity=0.6,color = I("orange"))
fig4d <- fig3d %>% add_trace(y = ~MntMeatProducts, name = 'Meat', opacity=0.6,color = I("magenta"))
fig5d <- fig4d %>% add_trace(y = ~MntSweetProducts, name = 'Sweets', opacity=0.6,color = I("yellow"))
fig6d <- fig5d %>% add_trace(y = ~MntWines, name = 'Wines', opacity=0.6,color = I("salmon"))
fig7d <- fig6d %>% layout(title = list(text = "Grouped Barplot for expenditure of countries on different products"),
plot_bgcolor='#e5ecf6',yaxis = list(title = 'Expenditure',range=c(0,350000)), barmode = 'grouped')
fig7dIt is imperative to understand the distribution of different educational status and marital statuses of customers to examine if there is a group that dominates the dataset.
#Pie chart for education status and marital status
fig <- plot_ly(mai = c(0, 0, 0, 0),marker = list(colors=c("peachpuff","lightsteelblue", "pink","paleturquoise", "lightgray","plum","mediumslateblue", "gray","lavenderblush")))
fig <- fig %>% add_pie(data = count(df2, Education), labels = ~Education,
values = ~n,name = "Education",domain = list(x = c(0, 0.4), y = c(0.2, 1)))
fig <- fig %>% add_pie(data = count(df2, Marital_Status), labels = ~Marital_Status,
values = ~n, name = "Marital_Status", domain = list(x = c(0.6, 1), y = c(0.2, 1)))
fig <- fig %>% layout(title = "Pie Charts for distribution of data based on Education and Marital Status", showlegend = T,strip.white=FALSE,xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE), yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))
figCustomers of which education level spent the most in store vs on the web? Though we live in a digital world, In store shopping is still very much alive. Despite online shopping becoming much more prevalent, a vast number of shoppers still make purchases in-store. Whether customers are shopping in-store or online, each type offers its own benefits. It is important to make sure and analyze if this shift happens based on the education status of the customer or not. Utilizing this target demographic analysis will help the business save time and money by discovering if the customer is more likely to make a purchase online or in a store.
dfsd1 <-unique(df2$Education)
dfsd2<-df2[df2$Education== 'Under.graduate',]
dfsd3<-df2[df2$Education=='Master',]
dfsd4<-df2[df2$Education=='PhD',]
Education <- c("Under.graduate", "Master", "PhD")
NumWebPurchases <- c(sum(dfsd2$NumWebPurchases), sum(dfsd3$NumWebPurchases), sum(dfsd4$NumWebPurchases))
NumStorePurchases <- c(sum(dfsd2$NumStorePurchases), sum(dfsd3$NumStorePurchases) , sum(dfsd4$NumStorePurchases))
df <- data.frame(Education, NumWebPurchases, NumStorePurchases)
df$Education <- factor(df$Education, levels = unique(df$Education)[order(df$NumStorePurchase, decreasing = FALSE)])
p <- plot_ly(df, x = ~Education, y = ~NumWebPurchases, type = 'bar', name = 'Number of WebPurchases',color=I("lightpink")) %>%
add_trace(y = ~NumStorePurchases, name = 'Num of StorePurchases',opacity=0.4,color=I('greenyellow')) %>%
layout(yaxis = list(title = 'Count'), barmode = 'group')
pCustomers of what marital status spent the most in store vs on the web?
This analysis examines whether the demographic profile of a customer affects their attitude towards shopping behavior. The demographic profile variable that could be linked to attitude was chosen as marital status. The answer to this question would provide some understanding to service providers and businesses on the effect of demographic profile on online and in store shopping. This would help them in finding and implementing suitable strategies to enhance targeted shopping experience.
dfms1 <-unique(df2$Marital_Status)
dfms2<-df2[df2$Marital_Status== 'Divorced',]
dfms3<-df2[df2$Marital_Status=='Single',]
dfms4<-df2[df2$Marital_Status=='Married',]
dfms5<-df2[df2$Marital_Status=='Together',]
dfms6<-df2[df2$Marital_Status=='Widow',]
dfms7<-df2[df2$Marital_Status=='YOLO',]
dfms8<-df2[df2$Marital_Status=='Alone',]
dfms9<-df2[df2$Marital_Status=='Absurd',]
Marital_Status <- c("Divorced", "Single" , " Married" , "Together","Widow" , "YOLO" , "Alone" , "Absurd")
NumWebPurchasesms <- c(sum(dfms2$NumWebPurchases), sum(dfms3$NumWebPurchases),sum(dfms4$NumWebPurchases) ,sum(dfms5$NumWebPurchases) ,sum(dfms6$NumWebPurchases) ,sum(dfms7$NumWebPurchases),sum(dfms8$NumWebPurchases),sum(dfms9$NumWebPurchases))
NumStorePurchasesms <- c(sum(dfms2$NumStorePurchases), sum(dfms3$NumStorePurchases) , sum(dfms4$NumStorePurchases),sum(dfms5$NumStorePurchases) ,sum(dfms6$NumStorePurchases),sum(dfms7$NumStorePurchases),sum(dfms8$NumStorePurchases),sum(dfms9$NumStorePurchases))
dfMS <- data.frame(Marital_Status, NumWebPurchasesms, NumStorePurchasesms)
dfMS$Marital_Status <- factor(dfMS$Marital_Status, levels = unique(dfMS$Marital_Status)[order(dfMS$NumStorePurchasesms, decreasing = FALSE)])
p_marr_status <- plot_ly(dfMS, x = ~Marital_Status, y = ~NumWebPurchasesms, type = 'bar', name = 'Number of WebPurchases',color=I("darkgray")) %>%
add_trace(y = ~NumStorePurchasesms, name = 'Num of StorePurchases',opacity=0.6,color=I("salmon")) %>%
layout(yaxis = list(title = 'Count'), barmode = 'group')
p_marr_statusSampling is a technique used to select a subset of the population to make statistical inferences from them and draw conclusions about the whole population. For example, if a company would like to know the customer preferences for an entire country, it is not plausible to conduct a research study that involves everyone. Thus a representative and unbiased sample of this population is used instead. There are different Sampling techniques that can be used. For this analysis, we have used Simple Random Sampling with Replacement, Simple Random Sampling without Replacement, Systematic Sampling and Stratified Sampling. We have demonstrated the use of these techniques by plotting a histogram for the Income feature (Sample size was taken as 80).
In a simple random sampling (SRS), every item of the frame (list of items that define a population) has an equal chance of being selected in the sample. Samples can be chosen with or without replacement. Here ‘n’ represents the sample size and ‘N’ represents number of items in the frame. The probability of selecting each item in SRS with replacement is 1/N. In case of SRS without replacement, the probability of selecting the first element is 1/N, second element is 1/N-1, and so on.
In Systematic sampling, the frame is divided into groups.If ‘n’ is the sample size and ‘N’ is the size of the frame, each group has N/n=k items.Systematic sampling involves a random start (i.e. item from the first group of k elements is randomly selected). After this, every kth element from the frame is selected to be a part of the sample. We can also compute the probability for each item to be included in the sample with probabilities proportional to the size. Here we compute probabilities based on total expenditure of each customer.
In Stratified Sampling, the frame is divided into subgroups based on distinct characteristics of the data. These subgroups are called ‘strata’. Each strata is then sampled as an independent sub-population, out of which individual elements can be randomly selected and then combined in order to form the sample.
#Sampling
library("sampling")
#Plotting histogram for original data
set.seed(123)
fig8 <- plot_ly(x = df2$Income, type = "histogram",histnorm = "probability",color=I("darkgray"),name="Histogram for all data",opacity=0.8)%>% layout( xaxis=list(zerolinecolor = '#fff',zerolinewidth = 2,gridcolor = 'ffff',range=c(0,180000),title="Income"),yaxis=list(range=c(0,0.040),title="Density"))
fig8%>%layout(title="Histogram for all data")#Simple Random Sampling with replacement
set.seed(123)
N <- nrow(df2)
n <- 80
s<-srswr(n,N)
#s
#s[s!=0]
rows<-rep((1:nrow(df2))[s!=0],s[s!=0])
srs<-df2[rows,]
fig8a <- plot_ly(x = srs$Income, type = "histogram",histnorm = "probability",color=I("greenyellow"),name="Srswr",opacity=0.5)%>% layout(xaxis=list(zerolinecolor = '#fff',zerolinewidth = 2,gridcolor = 'ffff',range=c(0,180000),title="Income"),yaxis=list(range=c(0,0.25),title="Density"))
#Simple Random Sampling without replacement
set.seed(123)
s2<-srswor(n,N)
#s2
#s2[s2!=0]
rows2<-(1:nrow(df2))[s2!=0]
srs2<-df2[rows2,]
fig8b<- plot_ly(x = srs2$Income, type = "histogram",histnorm = "probability",color=I("orange"),name="Srswor",opacity=0.8)%>% layout(xaxis=list(zerolinecolor = '#fff',zerolinewidth = 2,gridcolor = 'ffff',title="Income",range=c(0,180000)),yaxis=list(range=c(0.25),title="Density"))
#Systematic Sampling
set.seed(123)
k<-ceiling(N/n)
#k
r<-sample(k,1)
#r
sys<-seq(r,by=k,length=n)
#sys
s3<-df2[sys,]
#s3
fig8c <- plot_ly(x = s3$Income, type = "histogram",histnorm = "probability",color=I("pink"),opacity=0.5,name="Systematic Sampling")%>% layout(xaxis=list(zerolinecolor = '#fff',zerolinewidth = 2,gridcolor = 'ffff',title="Income",range=c(0,180000)),yaxis=list(title="Density",range=c(0,0.25)))
#Systematic Sampling (Unequal probabilities)
set.seed(123)
pick<-inclusionprobabilities(df2$Total.expenditure,80)
s3b<-UPsystematic(pick)
s3c<-df2[s3b!=0,]
fig8c2 <- plot_ly(x = s3c$Income, type = "histogram",histnorm = "probability",color=I("blue"),opacity=0.5,name="Systematic Sampling (unequal prob)")%>% layout(xaxis=list(zerolinecolor = '#fff',zerolinewidth = 2,gridcolor = 'ffff',title="Income",range=c(0,180000)),yaxis=list(range=c(0.25),title="Density"))
#Stratified Sampling
set.seed(123)
df2$Education<-factor(df2$Education,ordered=T,levels=c('Under.graduate','Master','PhD'))
freq<-table(df2$Education)
st.sizes<-80*(freq/sum(freq)) #calculating sample sizes for unequal strata
#st.sizes
s4<-sampling::strata(df2,stratanames=c("Education"),method="srswor",size=st.sizes)
#s4
d<-getdata(df2,s4)
fig8d <- plot_ly(x = d$Income, type = "histogram",histnorm = "probability",color=I("magenta"),name="Stratified sampling",opacity=0.5)%>% layout(xaxis=list(zerolinecolor = '#fff',zerolinewidth = 2,gridcolor = 'ffff',title="Income",range=c(0,180000)),yaxis=list(title="Density",range=c(0,0.25)))
s <- subplot(fig8a, fig8b, fig8c, fig8c2, fig8d, nrows = 3, shareY = TRUE)%>%layout(title='Sampling techniques')
smean(df2$Income) #population mean
mean(srs$Income) #srswr
mean(srs2$Income) #srswor
mean(s3$Income) #systematic
mean(s3c$Income) #systematic with unequal probabilities
mean(d$Income) #stratified
sd(df2$Income) #population standard deviation
sd(srs$Income) #srswr
sd(srs2$Income) #srswor
sd(s3$Income) #systematic
sd(s3c$Income) #systematic with unequal probabilities
sd(d$Income) #stratifiedThe mean of the samples extracted using the various sampling techniques was found to be very close to the population mean, except for systematic sampling with unequal probabilities. Standard deviation of the samples is also close to the population’s standard deviation, except in the case of systematic sampling. Hence we can use these samples in order to draw inferences about our population.
Is there a relationship between a customer’s income and their expenditure?
The relationship between income and expenditure is often called a consumption schedule. It is used to describe economic trends in the household sector. When there is more money or anticipation of income, more goods are purchased by consumers. Meaning money is spent on expenditures, at times, even if there isn’t enough income to cover them. This is a common economic principal used to describe spending trends for national and world economies. A business should consider the relationship between consumption and savings to extract data on buyer trends within its own industry. Hence, this question will help to make a business decision about the relationship of expenditure vs income for the given dataset.
Using SLR, we assert a straight line on the scatterplot that represents the best fitting line to the data that captures the pattern of the relationship.We build a data model that allows us to:
The equation for the simple linear regression line is given by y = β0 + β1x, where y is the response or dependent variable, x is the explanatory or independent variable, β0 is the intercept (the value of y when x = 0), β1 is the slope (the expected change in y for each one-unit change in x).The most common way to find the best fit line is to minimize the sum of the squares of the distances between the points and the regression line. This approach is called the least-squares method.
pal <- c("peachpuff","deeppink","darkmagenta","blueviolet","yellow","darkolivegreen")
dfexp <- df2[,c(29,5,30)]
cor(df2$Total.expenditure,df2$Income) #r## [1] 0.7903145
mdl<-lm(df2$Total.expenditure~df2$Income)
summary(mdl) #R2=0.6244##
## Call:
## lm(formula = df2$Total.expenditure ~ df2$Income)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2986.6 -225.7 -40.5 208.0 2244.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.696e+02 2.141e+01 -26.60 <2e-16 ***
## df2$Income 2.256e-02 3.766e-04 59.89 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 370.5 on 2156 degrees of freedom
## Multiple R-squared: 0.6246, Adjusted R-squared: 0.6244
## F-statistic: 3587 on 1 and 2156 DF, p-value: < 2.2e-16
dfexp %>%
plot_ly(x = ~Income, y = ~Total.expenditure,colors = pal) %>%
add_markers(color = ~Age) %>%
add_lines(x = ~Income, y = fitted(mdl))fitted(mdl) #predicted values
resid(mdl) #residuals of the model (difference between actual and predicted values)rmse<-sqrt(mean(sum((df2$Total.expenditure-fitted(mdl))^2)))
cat("RMSE is", rmse)## RMSE is 17202.95
mae<-mean(abs(df2$Total.expenditure-fitted(mdl)))
cat("MAE is", mae) #mean absolute error## MAE is 273.2538
We created a linear model to predict total expenditure based on their income. Here income was taken as the independent or explanatory variable and total expenditure was the response or dependent variable.
As income increases, expenditure was seen to increase as well. Curved/ quadratic relationship was observed from the scatterplot. Looking at the scatterplot,
Correlation coefficient, r was found to be 0.79. Thus we can say that there is a positive association, since correlation coefficient > 0.Also 0.79 is closer to 1, hence explanatory and response variables are decently correlated, association is moderately strong.
On creating a linear regression model, we found that 62.4% of the variation in Total expenditure is explained by income. Coefficient B1=0.0225 and intercept B0=-569.2.
On a $1 increase in income, total expenditure increases by $0.0225. Root mean square error is 17202 and Mean absolute error is 273.25.
Linear equation: y= -569.2 + 0.0225x, where y=total expenditure and x=income. Applying transformations on the data and using a generalised data model could help improve these results.
H0 (null hypothesis): B1=0 (no linear association between Income and Total expenditure)
Ha (alternate hypothesis): B1!=0 (there is a linear association between x and y) alpha or significance level is 0.05.
One sample two sided t test.
t=B1/(sqrt(sum(sq(actual y-predicted y))/n-2) / sqrt(sum(sq(actual x-mean x))))
degree.of.freedom<-nrow(df)-2 degree.of.freedom [1]2238
If |t|>=critical t value, reject null hypothesis. Otherwise, do not reject the null hypothesis.
crit.t <- qt(0.975,degree.of.freedom) #1.96
Decision rule: Reject H0 if |t|>= 1.96, or else do not reject.
summary(mdl) t= 59.78
As t > critical value of t, reject null hypothesis.
We have sufficient evidence at alpha=0.05 level that B1!=0. There is a significant linear association between income and total expenditure.
confint(mdl,level=0.95)## 2.5 % 97.5 %
## (Intercept) -611.5544338 -527.57826113
## df2$Income 0.0218181 0.02329524
We can say with 95% confidence that true value of B1 lies between 0.021 and 0.023.
The Central Limit Theorem states that the distribution of the sample means for a given sample size of the population has the shape of a normal distribution. The mean of the sampling distribution is equal to the population mean and standard deviation is equal to the population standard deviation/ sqrt(n), where n is the sample size. As the sample size is increased, standard deviation decreases. Using attribute Age from our dataset, we show the application of Central Limit Theorem. Shown below are histograms for sample means of 250 random samples of sample size 10, 20, 30 and 40.
#hist(df2$Age)
agedf <- data.frame(df2$Age)
#agedf
sample23 <- 250
xbars1 = numeric(250)
xbars2 = numeric(250)
xbars3 = numeric(250)
xbars4 = numeric(250)
for (i in 1:250) {
xbars1[i] <- mean(sample(df2$Age,
size = 10, replace = TRUE))
}
agedf1 <- data.frame(xbars1)
#agedf1
p1 <- plot_ly(agedf1, x= ~agedf1$xbars1, type = 'histogram',histnorm = 'probability',color=I("blue"),opacity=0.4,name="Sample size:10")%>%layout(xaxis=list(range=c(40,65)),yaxis=list(title="Density",range=c(0,0.15)))
for (i in 1:250) {
xbars2[i] <- mean(sample(df2$Age,
size = 20, replace = TRUE))
}
agedf2 <- data.frame(xbars2)
p2 <- plot_ly(agedf2, x= ~agedf2$xbars2, type = 'histogram',histnorm = 'probability',color=I("salmon"),name="Sample size:20")%>%layout(xaxis=list(range=c(40,65)),yaxis=list(range=c(0,0.15)))
for (i in 1:250) {
xbars3[i] <- mean(sample(df2$Age,
size = 30, replace = TRUE))
}
agedf3 <- data.frame(xbars3)
p3 <- plot_ly(agedf3, x= ~agedf3$xbars3, type = 'histogram',histnorm = 'probability',color=I("greenyellow"),opacity=0.6, name="Sample size:30")%>%layout(xaxis=list(range=c(40,65)),yaxis=list(title="Density",range=c(0,0.15)))
for (i in 1:250) {
xbars4[i] <- mean(sample(df2$Age,
size = 40, replace = TRUE))
}
agedf4 <- data.frame(xbars4)
p4 <- plot_ly(agedf4, x= ~agedf4$xbars4, type = 'histogram',color=I("pink"),histnorm = 'probability',name="Sample size:40")%>%layout(xaxis=list(range=c(40,65)),yaxis=list(range=c(0,0.15)))
p <- subplot(p1, p2, p3, p4, nrows = 2, shareY = TRUE)%>%layout(title="Central Limit Theorem")
pcat("Sample Size = ", 10, " Mean = ", mean(xbars1),
" SD = ", sd(xbars1))## Sample Size = 10 Mean = 52.2988 SD = 3.969011
cat("Sample Size = ", 20, " Mean = ", mean(xbars2),
" SD = ", sd(xbars2))## Sample Size = 20 Mean = 52.1824 SD = 2.488106
cat("Sample Size = ", 30, " Mean = ", mean(xbars3),
" SD = ", sd(xbars3))## Sample Size = 30 Mean = 52.26093 SD = 2.003406
cat("Sample Size = ", 40, " Mean = ", mean(xbars4),
" SD = ", sd(xbars4))## Sample Size = 40 Mean = 52.4402 SD = 1.914103
As we can see, the mean of the sampling distribution remains almost the same as the overall population mean, whereas the standard deviation decreases as sample size is increased.
“The proof is in the purchase,” as the saying goes.
Purchase-based targeting (PBT) goes after consumers who have a purchase history with your offering. Statistics show that those who bought once are more likely to buy again. PBT is highly productive and so this question will help to know people of what age have opted for more deals than others and this will ultimately help strategize PBT of consumers.
plot_ly(df2,x=df2$NumDealsPurchases,y = ~df2$Age, type="box",opacity=0.6,color=I("darkcyan"))%>%layout(title="Boxplot for Number of deals and Age of Customer", xaxis=list(title="Number of deals",range=c(0,16)),yaxis=list(title="Age"))df2 %>%
plot_ly(x = ~Age, y = ~NumDealsPurchases,opacity=0.6,color=I("darkcyan")) %>%
add_markers(size = ~Income) %>% layout(title="Number of deals and Age",yaxis=list(title="Number of deals",range=c(0,16)),xaxis=list(title="Age"))#Stacked bar plot for kids and teens and the no of web purchases
g1 <- plot_ly(df2, x = ~NumWebPurchases, y = ~Teenhome, color=I("lightslategrey"),type = 'bar',opacity=0.6, name = 'Teens') %>%
add_trace(y = ~Kidhome, color=I("grey"),opacity=0.5,name = 'Kids') %>%
layout(title="Teens,kids and web purchases",yaxis = list(title = 'Teens/Kids'), barmode = 'stack',xaxis=list(title="Web purchases",range=c(0,14000)))
g1#Stacked bar plot for kids and teens on the no of in store purchases
s1 <- plot_ly(df2, x = ~NumStorePurchases, color=I("lightslategrey"),opacity=0.6, y = ~Teenhome, type = 'bar', name = 'Teens') %>%
add_trace(y = ~Kidhome, color=I("grey"),opacity=0.5,name = 'Kids') %>%
layout(title="Teens,kids and in store purchases",yaxis = list(title = 'Teens/Kids'), barmode = 'stack',xaxis=list(title="In store purchases",range=c(0,20000)))
s1df2%>%group_by(Kidhome)%>%summarise(sum(NumWebPurchases))
df2%>%group_by(Kidhome)%>%summarise(sum(NumStorePurchases))
df2%>%group_by(Teenhome)%>%summarise(sum(NumWebPurchases))
df2%>%group_by(Teenhome)%>%summarise(sum(NumStorePurchases))#All kinds of Purchases based on number of kids as a Grouped Bar Chart
fig1d <- plot_ly(data=df2, x = ~Kidhome, y = ~NumStorePurchases, type = 'bar', opacity=0.5,color = I("greenyellow"),name="No of In Store Purchases")
fig2d <- fig1d %>% add_trace(y = ~NumWebPurchases, name = 'Web Purchases', opacity=0.5,color = I("magenta"))
fig3d <- fig2d %>% add_trace(y = ~NumDealsPurchases, name = 'Deals Purchased', opacity=0.6,color = I("orange"))
fig4d <- fig3d %>% add_trace(y = ~NumCatalogPurchases, name = 'Catalog Purchases', opacity=0.5,color = I("salmon"))
fig5d <- fig4d %>% add_trace(y = ~NumWebVisitsMonth, name = 'Web Visits/Month', opacity=0.4,color = I("blue"))
fig6d <- fig5d %>% layout(title="Types of purchases based on number of kids",yaxis = list(title = 'Purchases',range=c(0,30000)),xaxis=list(title="Kids"), barmode = 'stack')
fig6d#All kinds of Purchases based on number of teens as a stacked bar chart
fig1d <- plot_ly(data=df2, x = ~Teenhome, y = ~NumStorePurchases, type = 'bar', opacity=0.5,color = I("greenyellow"),name="No of In Store Purchases")
fig2d <- fig1d %>% add_trace(y = ~NumWebPurchases, name = 'Web Purchases', opacity=0.5,color = I("magenta"))
fig3d <- fig2d %>% add_trace(y = ~NumDealsPurchases, name = 'Deals Purchased', opacity=0.6,color = I("orange"))
fig4d <- fig3d %>% add_trace(y = ~NumCatalogPurchases, name = 'Catalog Purchases', opacity=0.5,color = I("salmon"))
fig5d <- fig4d %>% add_trace(y = ~NumWebVisitsMonth, name = 'Web Visits/Month', opacity=0.4,color = I("blue"))
fig6d <- fig5d %>% layout(title="Types of purchases based on number of teens",yaxis = list(title = 'Purchases',range=c(0,25000)), barmode = 'stack',xaxis=list(title="Teens"))
fig6dTop 3 categories of purchases include:
Inferences Based on Purchase: 1. Investment in deals and catalog purchases was comparatively scarce. 2. Highest number of In store and catalog purchases were made by customers with no teens. 3. Highest number of Web and Deal purchases as well as web visits were made by customers with one teen. 4. Highest number of In store, catalog and web purchases were made by customers with no kids. 5. Highest number of monthly Web visits were made by customers with one kid. 6. Customers with two kids or two teens made the least number of purchases.
Our data spanned across 2.5 years (from 2012 to 2014). For 2012 and 2014, we only had data for the second and first half of the year respectively. We separated the date column to get year, month and day values and filtered it to get year wise data. Subsequently, we used group_by and summarise operations to get the sum of all web visits per month. We then arranged this data in ascending order based on months. Our aim was to see if there was any trend in web visits across months for three years.
df2 %>%
separate(Dt_Customer, sep="/", into = c("month","day","year"))->df3
df3%>%filter(year=="12")%>%group_by(month)%>%summarise(Webvisits=sum(NumWebVisitsMonth))->a
a$month<-as.numeric(a$month)
a<-a%>%arrange(month)
df3%>%filter(year=="13")%>%group_by(month)%>%summarise(Webvisits=sum(NumWebVisitsMonth))->b
b$month<-as.numeric(b$month)
b<-b%>%arrange(month)
df3%>%filter(year=="14")%>%group_by(month)%>%summarise(Webvisits=sum(NumWebVisitsMonth))->c
c$month<-as.numeric(c$month)
c<-c%>%arrange(month)
fig9 <- plot_ly(a, x = ~month, y = ~Webvisits, type = 'scatter', name="Year 2012",mode = 'lines',line=list(color="greenyellow",width=4))%>%layout(xaxis=list(range=c(1,12)),yaxis=list(range=c(0,800)))
fig9b <- plot_ly(b, x = ~month, y = ~Webvisits, type = 'scatter', name="Year 2013", mode= 'lines',line=list(color="orange",width=4))%>%layout(xaxis=list(range=c(1,12)),yaxis=list(range=c(0,800)))
fig9c<-plot_ly(c, x = ~month, y = ~Webvisits, type = 'scatter', name="Year 2014", mode = 'lines',line=list(color="pink",width=4))%>%layout(xaxis=list(range=c(1,12)),yaxis=list(range=c(0,800)))
fig9d <- subplot(fig9, fig9b, fig9c, nrows = 3) %>%
layout(title = list(text = "Lineplots for number of web visits per month"),
plot_bgcolor='#e5ecf6')
fig9d